Background

This module builds on code contained in Coronavirus_Statistics_USAF_v006.Rmd. This file includes the latest code for analyzing data from USA Facts. USA Facts maintains data on cases and deaths by county for coronavirus in the US. Downloaded data are unique by county with date as a column and a separate file for each of cases, deaths, and population.

The intent of this code is to source updated functions that allow for readRunUSAFacts() to be run to obtain, read, process, and analyze data from USA Facts.

Sourcing Functions

The tidyverse library is loaded, and the functions used for CDC daily processing are sourced. Additionally, specific functions for USA Facts are also sourced:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Functions are available in source file
source("./Generic_Added_Utility_Functions_202105_v001.R")
source("./Coronavirus_CDC_Daily_Functions_v002.R")
source("./Coronavirus_USAF_Functions_v001.R")

Further, the mapping file specific to USA Facts is sourced:

source("./Coronavirus_USAF_Default_Mappings_v002.R")

Updated functions for diagnoseClusters(), createDetailedSummaries(), createSummary(), and helperSummaryMap() are included in Coronavirus_USAF_Functions_v001.R. These functions should be checked for consistency with state-level data with just a single copy kept later.

Example Process

The latest county-level burden data are downloaded:

readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20220913.csv", 
                 "usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20220913.csv"
                 )
compareList <- list("usafCase"=readFromRDS("cty_newdata_20220807")$dfRaw$usafCase, 
                    "usafDeath"=readFromRDS("cty_newdata_20220807")$dfRaw$usafDeath
                    )

# Use existing clusters
cty_newdata_20220913 <- readRunUSAFacts(maxDate="2022-09-11", 
                                        downloadTo=lapply(readList, 
                                                          FUN=function(x) if(file.exists(x)) NA else x
                                                          ),
                                        readFrom=readList, 
                                        compareFile=compareList, 
                                        writeLog="./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log", 
                                        ovrwriteLog=TRUE,
                                        useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
                                        skipAssessmentPlots=FALSE,
                                        brewPalette="Paired"
                                        )
## 
## No file has been downloaded, will use existing file: ./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20220913.csv
## Rows: 3193 Columns: 962
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): County Name, State, StateFIPS
## dbl (959): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 33
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 128 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## Rows: 3193 Columns: 962
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): County Name, State, StateFIPS
## dbl (959): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 33
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 3 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log

## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 108 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20220913_chk_v005.log
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType    cases     new_cases            n
##   <chr>     <dbl>         <dbl>        <dbl>
## 1 before 3.45e+10 93305986      3058894     
## 2 after  3.43e+10 91158859      3010036     
## 3 pctchg 7.16e- 3        0.0230       0.0160
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType  deaths   new_deaths            n
##   <chr>    <dbl>        <dbl>        <dbl>
## 1 before 5.12e+8 1034939      3058894     
## 2 after  4.95e+8  964043      3010036     
## 3 pctchg 3.23e-2       0.0685       0.0160

## NULL

# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20220913$useClusters, 
                       caption="Includes only counties with 25k+ population",
                       brewPalette="viridis"
                       )

# Save the refreshed file
saveToRDS(cty_newdata_20220913, ovrWriteError=FALSE)

Vaccines data are also updated, though the process needs to integrate previous data:

cty_vaxdata_20220914 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20220914.csv", 
                                              ctyList=cty_newdata_20220913, 
                                              minDateCD=c("2022-06-09", "2022-06-09"),
                                              maxDateCD="2022-09-01"
                                              )
## Rows: 78961 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
##   state     n
##   <chr> <int>
## 1 AS       24
## 2 FM       25
## 3 GU       48
## 4 MH       24
## 5 MP       24
## 6 PR     1901
## 7 PW       24
## 8 VI       96
## 9 <NA>     17

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## 
## Count of NA records by column
##           state            FIPS popgte65_minpop popgte65_maxpop    popgte65_nnA 
##               0               0               0               0               0 
##               n 
##               0 
## 
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## #   maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
## 
## 
## 
## Will run with parameters:
## burdenVar: cpm dpm 
## vaxVar: vxcpoppct vxcpoppct 
## minDateCD: 2022-06-09 2022-06-09 
## maxDateCD: 2022-09-01 2022-09-01
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -289154037   -1505413     -90865    1548407   62031339 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20807.226   2187.230   9.513   <2e-16 ***
## vaxMetric       8.061     33.809   0.238    0.812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7367000 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  1.82e-05,   Adjusted R-squared:  -0.0003019 
## F-statistic: 0.05684 on 1 and 3124 DF,  p-value: 0.8116
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -288389946   -1463009     -64586    1597997   65055859 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                12176.47    7942.37   1.533 0.125352    
## type>500k               29162.62    4682.01   6.229 5.34e-10 ***
## type100k-500k           17769.84    4666.92   3.808 0.000143 ***
## type25k-100k            17521.42    5263.39   3.329 0.000882 ***
## vaxMetric:type<25k        175.75     159.92   1.099 0.271861    
## vaxMetric:type>500k      -110.85      66.36  -1.670 0.094933 .  
## vaxMetric:type100k-500k    60.38      74.96   0.805 0.420630    
## vaxMetric:type25k-100k     66.14      99.33   0.666 0.505574    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7368000 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.4677, Adjusted R-squared:  0.4663 
## F-statistic: 342.4 on 8 and 3118 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3529509   -12224    -2495    14425   697276 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 298.9805    29.5946  10.103   <2e-16 ***
## vaxMetric    -4.0106     0.4575  -8.767   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99680 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.02401,    Adjusted R-squared:  0.0237 
## F-statistic: 76.86 on 1 and 3124 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3480323   -13174    -5106    10332   707402 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                130.2314   107.0422   1.217  0.22384    
## type>500k               383.7067    63.1011   6.081 1.34e-09 ***
## type100k-500k            99.6779    62.8978   1.585  0.11312    
## type25k-100k            207.0272    70.9366   2.918  0.00354 ** 
## vaxMetric:type<25k       -0.3702     2.1553  -0.172  0.86365    
## vaxMetric:type>500k      -5.4661     0.8943  -6.112 1.11e-09 ***
## vaxMetric:type100k-500k  -0.5040     1.0103  -0.499  0.61791    
## vaxMetric:type25k-100k   -1.8241     1.3388  -1.362  0.17314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99300 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.05224,    Adjusted R-squared:  0.04981 
## F-statistic: 21.48 on 8 and 3118 DF,  p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20220914, ovrWriteError=FALSE)

County-level data are post-processed:

cty_postdata_20220913 <- postProcessCountyData(lstCtyBurden=cty_newdata_20220913$dfPerCapita, 
                                               lstCtyVax=cty_vaxdata_20220914$vaxFix, 
                                               lstState=readFromRDS("cdc_daily_220902")$dfPerCapita
                                               )
## 
## Parameter maxDate is: 2022-09-01

# Save the refreshed file
saveToRDS(cty_postdata_20220913, ovrWriteError=FALSE)

Additional post-processing steps are run:

# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20220913, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).

# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20220913, 
                            p1CompareStates=c("GA", "FL", "NE"), 
                            p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"), 
                            p3VaxTimes=sort(c("2022-01-01", "2022-08-31")),
                            p4DF=cty_newdata_20220913$dfPerCapita
                            )
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).

Additional plots are also updated:

# Creating the vaccine and burden data
tmpVaxBurden <- createVaxBurdenData(lstVax=cty_vaxdata_20220914, lstBurden=cty_newdata_20220913)

# Nationwide aggregation, excluding problem states
problemStates <- c("VA", "TX", "SD", "HI", "GA", "CO", "GA", "FL", "NE", "VA")
useStates <- state.abb
tmpCountyList <- tmpVaxBurden$ctyPop %>%
    filter(state %in% useStates, !(state %in% problemStates)) %>%
    left_join(filter(tmpVaxBurden$dfVaxBurden, name=="vxcpoppct", date==max(date)), by="countyFIPS") %>%
    mutate(vaxPct=percent_rank(value), 
           vaxBucket=case_when(vaxPct <= .25 ~ "3. Low", vaxPct >= .75 ~ "1. High", TRUE ~ "2. Medium")
           ) %>%
    split(f=.$vaxBucket)

# Plot of absolute burden
plotVaxBurdenData(tmpVaxBurden, 
                  ctyPlot=lapply(tmpCountyList, FUN=function(x) x %>% rename(bucket=vaxBucket)), 
                  plotTitle="Counties in states with continuous county data"
                  )

# Plots of burden relative to May 2021
plotVaxBurdenData(tmpVaxBurden, 
                  ctyPlot=lapply(tmpCountyList, FUN=function(x) x %>% rename(bucket=vaxBucket)), 
                  plotTitle="Counties in states with continuous county data", 
                  scaleToDate="2021-05-01"
                  )

Multiple data anomalies need to be addressed - vaccines data covering only a short time period, counties significantly revising burden downwards, etc.

Counties with significant negative burden are investigated:

keyRatio <- cty_newdata_20220913$dfPerCapita %>%
    select(countyFIPS, state, date, cases, deaths) %>%
    group_by(countyFIPS, state) %>%
    summarize(keyRatDeath=sum(ifelse(date==max(date), deaths, 0)/max(deaths)), 
              keyRatCases=sum(ifelse(date==max(date), cases, 0)/max(cases)),
              .groups="drop"
              )

keyRatio %>%
    mutate(across(where(is.numeric), .fns=function(x) round(x, 3))) %>%
    count(keyRatCases, keyRatDeath) %>%
    ggplot(aes(x=keyRatCases, y=keyRatDeath)) + 
    geom_point(aes(size=n)) + 
    lims(x=c(0, 1), y=c(0, 1)) + 
    labs(x="Latest cases vs. maximum cases", 
         y="Latest deaths vs. maximum deaths", 
         title="County-level restatement"
         )
## Warning: Removed 6 rows containing missing values (geom_point).

keyRatio %>%
    mutate(across(where(is.numeric), .fns=function(x) round(x, 3))) %>%
    count(keyRatCases, keyRatDeath) %>%
    filter(keyRatCases < 1 | keyRatDeath < 1) %>%
    ggplot(aes(x=keyRatCases, y=keyRatDeath)) + 
    geom_point(aes(size=n)) + 
    lims(x=c(0, 1), y=c(0, 1)) + 
    labs(x="Latest cases vs. maximum cases", 
         y="Latest deaths vs. maximum deaths", 
         title="County-level restatement", 
         subtitle="Only counties with at least one ratio .999 or lower included"
         )
## Warning: Removed 4 rows containing missing values (geom_point).

Restatement of deaths is much more common than restatement of cases. Counties with declines are further explored:

decStatus <- keyRatio %>%
    mutate(rd=case_when(is.na(keyRatDeath) ~ -1, 
                        keyRatDeath==1 ~ 1, 
                        keyRatDeath>=.95 ~ .95, 
                        keyRatDeath>=.9 ~ .9, 
                        keyRatDeath>=.75 ~ .75, 
                        keyRatDeath>=.5 ~ .5, 
                        TRUE ~ 0
                        )
           )
decStatus %>%
    count(rd)
## # A tibble: 7 × 2
##      rd     n
##   <dbl> <int>
## 1 -1       20
## 2  0       17
## 3  0.5     23
## 4  0.75    61
## 5  0.9     34
## 6  0.95   118
## 7  1     2869
decStatus %>%
    count(rd, state) %>%
    filter(rd<.95, rd>=0, n>1) %>%
    arrange(rd, -n)
## # A tibble: 18 × 3
##       rd state     n
##    <dbl> <chr> <int>
##  1  0    NE        4
##  2  0    AK        3
##  3  0    TX        3
##  4  0    MA        2
##  5  0.5  IL       10
##  6  0.5  NE        9
##  7  0.5  AK        2
##  8  0.75 IL       37
##  9  0.75 MA       11
## 10  0.75 NE        4
## 11  0.75 KS        3
## 12  0.75 VA        2
## 13  0.9  IL       17
## 14  0.9  NE        4
## 15  0.9  CA        2
## 16  0.9  CO        2
## 17  0.9  FL        2
## 18  0.9  VA        2
cty_newdata_20220913$dfPerCapita %>%
    left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
    group_by(rd, date) %>%
    summarize(cases=sum(cases), deaths=sum(deaths), n=n(), .groups="drop") %>%
    mutate(labFacet=paste0(rd, " (n=", n, ")")) %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line() + 
    facet_wrap(~labFacet, scales="free_y") + 
    labs(x=NULL, 
         y="Reported deaths", 
         title="Reported deaths, facetted by change from maximum"
         )

Illinois and Nebraska appear to be main drivers of reported declines (discontinuities) in deaths. Reporting is potentially lagged, as much of the issue appears to be recent.

Restatement of cases is also explored:

decStatus <- keyRatio %>%
    mutate(rd=case_when(is.na(keyRatCases) ~ -1, 
                        keyRatCases==1 ~ 1, 
                        keyRatCases>=.99 ~ .99, 
                        keyRatCases>=.95 ~ .95, 
                        keyRatCases>=.5 ~ .5, 
                        TRUE ~ 0
                        )
           )
decStatus %>%
    count(rd)
## # A tibble: 6 × 2
##      rd     n
##   <dbl> <int>
## 1 -1        2
## 2  0        4
## 3  0.5     10
## 4  0.95    16
## 5  0.99    69
## 6  1     3041
decStatus %>%
    count(rd, state) %>%
    filter(rd<.99, rd>=0, n>1) %>%
    arrange(rd, -n)
## # A tibble: 5 × 3
##      rd state     n
##   <dbl> <chr> <int>
## 1  0    TX        3
## 2  0.5  NV        4
## 3  0.5  UT        2
## 4  0.95 NE        6
## 5  0.95 VA        6
cty_newdata_20220913$dfPerCapita %>%
    left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
    group_by(rd, date) %>%
    summarize(cases=sum(cases), deaths=sum(deaths), n=n(), .groups="drop") %>%
    mutate(labFacet=paste0(rd, " (n=", n, ")")) %>%
    ggplot(aes(x=date, y=cases)) + 
    geom_line() + 
    facet_wrap(~labFacet, scales="free_y") + 
    labs(x=NULL, 
         y="Reported cases", 
         title="Reported cases, facetted by change from maximum"
         )

In aggregate, the cases look OK, with the exception of a few counties that may have incomplete reporting in the most recent time period. Specific county declines are further explored:

decStatus <- keyRatio %>%
    mutate(rd=case_when(is.na(keyRatDeath) ~ -1, 
                        keyRatDeath==1 ~ 1, 
                        keyRatDeath>=.95 ~ .95, 
                        keyRatDeath>=.9 ~ .9, 
                        keyRatDeath>=.75 ~ .75, 
                        keyRatDeath>=.5 ~ .5, 
                        TRUE ~ 0
                        )
           )
decStatus %>%
    count(rd)
## # A tibble: 7 × 2
##      rd     n
##   <dbl> <int>
## 1 -1       20
## 2  0       17
## 3  0.5     23
## 4  0.75    61
## 5  0.9     34
## 6  0.95   118
## 7  1     2869
decStatus %>%
    count(rd, state) %>%
    filter(rd<.95, rd>=0, n>1) %>%
    arrange(rd, -n)
## # A tibble: 18 × 3
##       rd state     n
##    <dbl> <chr> <int>
##  1  0    NE        4
##  2  0    AK        3
##  3  0    TX        3
##  4  0    MA        2
##  5  0.5  IL       10
##  6  0.5  NE        9
##  7  0.5  AK        2
##  8  0.75 IL       37
##  9  0.75 MA       11
## 10  0.75 NE        4
## 11  0.75 KS        3
## 12  0.75 VA        2
## 13  0.9  IL       17
## 14  0.9  NE        4
## 15  0.9  CA        2
## 16  0.9  CO        2
## 17  0.9  FL        2
## 18  0.9  VA        2
cty_newdata_20220913$dfPerCapita %>%
    left_join(select(decStatus, countyFIPS, state, rd), by=c("countyFIPS", "state")) %>%
    filter(rd > 0, rd < 0.9) %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line(aes(group=countyFIPS, color=state)) + 
    facet_wrap(~rd, scales="free_y") + 
    labs(x=NULL, 
         y="Reported deaths by county", 
         title="Reported deaths, facetted by change from maximum"
         )

There are two separate issues - some counties appear to have incomplete data in the latest time period, while other counties appear to have significant negative restatement of data earlier in 2022

The ratio process is updated:

keyRatioDate <- cty_newdata_20220913$dfPerCapita %>%
    select(countyFIPS, state, date, cases, deaths) %>%
    pivot_longer(-c(countyFIPS, state, date)) %>%
    arrange(countyFIPS, state, name, date) %>%
    group_by(countyFIPS, state, name) %>%
    mutate(ratMax=value/max(value, na.rm=TRUE), cumMax=value/cummax(value)) %>%
    ungroup()
keyRatioDate
## # A tibble: 6,020,072 × 7
##    countyFIPS state date       name  value ratMax cumMax
##    <chr>      <chr> <date>     <chr> <dbl>  <dbl>  <dbl>
##  1 01001      AL    2020-01-22 cases     0      0    NaN
##  2 01001      AL    2020-01-23 cases     0      0    NaN
##  3 01001      AL    2020-01-24 cases     0      0    NaN
##  4 01001      AL    2020-01-25 cases     0      0    NaN
##  5 01001      AL    2020-01-26 cases     0      0    NaN
##  6 01001      AL    2020-01-27 cases     0      0    NaN
##  7 01001      AL    2020-01-28 cases     0      0    NaN
##  8 01001      AL    2020-01-29 cases     0      0    NaN
##  9 01001      AL    2020-01-30 cases     0      0    NaN
## 10 01001      AL    2020-01-31 cases     0      0    NaN
## # … with 6,020,062 more rows
## # ℹ Use `print(n = ...)` to see more rows
dfMinMax <- keyRatioDate %>%
    filter(!is.na(cumMax)) %>%
    group_by(countyFIPS, name) %>%
    summarize(minMax=min(cumMax), .groups="drop") 
dfMinMax %>%
    ggplot(aes(x=minMax)) + 
    geom_density(aes(group=name, color=name)) + 
    labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)", 
         subtitle="Lowest value per county and metric plotted", 
         x="Lowest value of daily burden vs. cumulative maximum of burden", 
         y="Density"
         ) + 
    scale_color_discrete("Metric")

dfMinMax %>%
    filter(minMax == 0, name=="deaths") %>%
    select(countyFIPS) %>%
    left_join(cty_newdata_20220913$dfPerCapita, by="countyFIPS") %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line(aes(group=countyFIPS)) + 
    facet_wrap(~countyFIPS, scales="free_y") + 
    labs(title="Reported deaths by counties with zero following non-zero", x=NULL, y="Reported deaths")

While some of the declines are anomalous, others appear to be curves that are either very low volume or smooth on a rolling-7 basis

The process is repeated to examine issues by state:

keyRatioDateState <- cty_newdata_20220913$dfPerCapita %>%
    group_by(state, date) %>%
    summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
    pivot_longer(-c(state, date)) %>%
    arrange(state, name, date) %>%
    group_by(state, name) %>%
    mutate(ratMax=value/max(value, na.rm=TRUE), cumMax=ifelse(cummax(value)==0, 1, value/cummax(value))) %>%
    ungroup()
keyRatioDateState
## # A tibble: 97,716 × 6
##    state date       name  value ratMax cumMax
##    <chr> <date>     <chr> <dbl>  <dbl>  <dbl>
##  1 AK    2020-01-22 cases     0      0      1
##  2 AK    2020-01-23 cases     0      0      1
##  3 AK    2020-01-24 cases     0      0      1
##  4 AK    2020-01-25 cases     0      0      1
##  5 AK    2020-01-26 cases     0      0      1
##  6 AK    2020-01-27 cases     0      0      1
##  7 AK    2020-01-28 cases     0      0      1
##  8 AK    2020-01-29 cases     0      0      1
##  9 AK    2020-01-30 cases     0      0      1
## 10 AK    2020-01-31 cases     0      0      1
## # … with 97,706 more rows
## # ℹ Use `print(n = ...)` to see more rows
dfMinMaxState <- keyRatioDateState %>%
    filter(!is.na(cumMax)) %>%
    group_by(state, name) %>%
    summarize(minMax=min(cumMax), .groups="drop") 
dfMinMaxState %>%
    ggplot(aes(x=minMax)) + 
    geom_density(aes(group=name, color=name)) + 
    labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)", 
         subtitle="Lowest value per state and metric plotted", 
         x="Lowest value of daily burden vs. cumulative maximum of burden", 
         y="Density"
         ) + 
    scale_color_discrete("Metric")

dfMinMaxState %>%
    filter(minMax < 0.95, name=="deaths") %>%
    select(state) %>%
    left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
    group_by(state, date) %>%
    summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line(aes(group=state)) + 
    facet_wrap(~state, scales="free_y") + 
    labs(title="Reported deaths by states with meaningfully non-ascending trend", x=NULL, y="Reported deaths")

Significant restatements appear to be in MA, while missing recent data appears to be in IL. It is unclear if MO and NE are still reporting county-level deaths. The data is potentially less complete and accurate than in previous iterations

The definition of a decline is modified to be min(cur-cummax)/max, so that declines of 1 or 2 early in the data are not flagged as major percentage changes:

keyRatioDateState_v2 <- cty_newdata_20220913$dfPerCapita %>%
    group_by(state, date) %>%
    summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
    pivot_longer(-c(state, date)) %>%
    arrange(state, name, date) %>%
    group_by(state, name) %>%
    mutate(ratMax=value/max(value, na.rm=TRUE), 
           cumMax=ifelse(cummax(value)==0, 1, (value-cummax(value))/max(value, na.rm=TRUE))
           ) %>%
    ungroup()
keyRatioDateState_v2
## # A tibble: 97,716 × 6
##    state date       name  value ratMax cumMax
##    <chr> <date>     <chr> <dbl>  <dbl>  <dbl>
##  1 AK    2020-01-22 cases     0      0      1
##  2 AK    2020-01-23 cases     0      0      1
##  3 AK    2020-01-24 cases     0      0      1
##  4 AK    2020-01-25 cases     0      0      1
##  5 AK    2020-01-26 cases     0      0      1
##  6 AK    2020-01-27 cases     0      0      1
##  7 AK    2020-01-28 cases     0      0      1
##  8 AK    2020-01-29 cases     0      0      1
##  9 AK    2020-01-30 cases     0      0      1
## 10 AK    2020-01-31 cases     0      0      1
## # … with 97,706 more rows
## # ℹ Use `print(n = ...)` to see more rows
dfMinMaxState_v2 <- keyRatioDateState_v2 %>%
    filter(!is.na(cumMax)) %>%
    group_by(state, name) %>%
    summarize(minMax=min(cumMax), .groups="drop") 
dfMinMaxState_v2 %>%
    ggplot(aes(x=fct_reorder(state, -minMax, min), y=1+minMax)) + 
    geom_col(fill="lightblue") +
    geom_text(aes(label=round(1+minMax, 2)), hjust=0) +
    labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)", 
         subtitle="Lowest value of 1 + (value - cummax(value)) / max(value)", 
         y="Lowest value", 
         x=NULL
         ) + 
    coord_flip() +
    facet_wrap(~name)

dfMinMaxState_v2 %>%
    filter(state %in% c("IL", "TX", "MA", "NE")) %>%
    select(state) %>%
    left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
    group_by(state, date) %>%
    summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line(aes(group=state)) + 
    facet_wrap(~state, scales="free_y") + 
    labs(title="Reported deaths by states with meaningfully non-ascending trend", x=NULL, y="Reported deaths")

dfMinMaxState_v2 %>%
    filter(state %in% c("IL", "WY")) %>%
    select(state) %>%
    left_join(cty_newdata_20220913$dfPerCapita, by="state") %>%
    group_by(state, date) %>%
    summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
    ggplot(aes(x=date, y=cases)) + 
    geom_line(aes(group=state)) + 
    facet_wrap(~state, scales="free_y") + 
    labs(title="Reported cases by states with meaningfully non-ascending trend", x=NULL, y="Reported cases")

The updated methodology better flags states with meaningful restatement problems

The process is converted to functional form:

# Function to calculate distance from global maxima and previous maxima (including self)
findDeltaFromMax <- function(df, groupBy=c(), timeVar="date", numVars=NULL) {
    
    # FUNCTION ARGUMENTS:
    # df: a data frame
    # groupBy: levels to which the final data should be aggregated
    # timeVar: time variable to which data should be aggregated
    # numVars: numeric variables to be summarized (NULL means all numeric variables)
    
    # Find numVars if not provided
    if(is.null(numVars)) numVars <- df %>% select(where(is.numeric)) %>% names %>% setdiff(groupBy)
    
    df %>%
        group_by_at(all_of(c(groupBy, timeVar))) %>%
        summarize(across(all_of(numVars), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop") %>%
        pivot_longer(all_of(numVars)) %>%
        arrange(across(all_of(c(groupBy, "name", timeVar)))) %>%
        group_by_at(all_of(c(groupBy, "name"))) %>%
        mutate(ratMax=value/max(value, na.rm=TRUE), 
               cumMax=ifelse(cummax(value)==0, 1, value/cummax(value)), 
               delMax=ifelse(cummax(value)==0, 1, (value-cummax(value))/max(value, na.rm=TRUE))
               ) %>%
        ungroup()
    
}

# Check for states
dfTest <- findDeltaFromMax(cty_newdata_20220913$dfPerCapita, groupBy="state", numVar=c("cases", "deaths"))
identical(dfTest %>% select(-delMax), keyRatioDateState)
## [1] TRUE
identical(dfTest %>% select(-cumMax) %>% rename(cumMax=delMax), keyRatioDateState_v2)
## [1] TRUE
# Check for counties - old approach output NaN rather than 1 when cummax(value)=0
dfTest_v2 <- findDeltaFromMax(cty_newdata_20220913$dfPerCapita, 
                              groupBy=c("countyFIPS", "state"), 
                              numVar=c("cases", "deaths")
                              )
identical(dfTest_v2 %>% select(-delMax, -cumMax), keyRatioDate %>% select(-cumMax))
## [1] TRUE
all.equal(dfTest_v2 %>% pull(cumMax), 
          keyRatioDate %>% select(cumMax) %>% mutate(cumMax=ifelse(is.na(cumMax), 1, cumMax)) %>% pull(cumMax)
          )
## [1] TRUE

Functions are also written for plot creation:

plotDeltaFromMax <- function(df, 
                             dfCtyData=NULL,
                             plotStateRatio=FALSE, 
                             plotDeathStates=c(), 
                             plotCaseStates=c()
                             ) {
    
    # FUNCTION ARGUMENTS:
    # df: data frame or tibble containing ratios by entity and date
    # dfCtyData: county-level per-capita data (not needed if only plotStateRatio is run)
    # plotStateRatio: should the state ratio plot be created?
    # plotDeathStates: vector of states to plot on the deaths metric (c() means do not plot)
    # plotCaseStates: vector of states to plot on the cases metric (c() means do not plot)

    
    # Plot 1: state ratios
    if(isTRUE(plotStateRatio)) {
        p1 <- df %>%
            filter(!is.na(delMax)) %>%
            group_by(state, name) %>%
            summarize(minMax=min(delMax), .groups="drop") %>%
            ggplot(aes(x=fct_reorder(state, -minMax, min), y=1+minMax)) + 
            geom_col(fill="lightblue") +
            geom_text(aes(label=round(1+minMax, 2)), hjust=0) +
            labs(title="Ratio of burden vs. cumulative maximum of burden (expected to be 1 for ascending sequence)", 
                 subtitle="Lowest value of 1 + (value - cummax(value)) / max(value)", 
                 y="Lowest value", 
                 x=NULL
                 ) + 
            coord_flip() +
            facet_wrap(~name)
        print(p1)
    }
    
    # Plots 2 and 3: Create case and death data
    if((length(plotCaseStates) > 0) | (length(plotDeathStates) > 0)) {
        dfPlot <- dfCtyData %>%
            select(state, date, cases, deaths) %>%
            filter(state %in% all_of(union(plotCaseStates, plotDeathStates))) %>%
            group_by(state, date) %>%
            summarize(across(c(cases, deaths), .fns=function(x) sum(x, na.rm=TRUE)), .groups="drop")
    }
    
    # Plot 2: Death states
    if(length(plotDeathStates) > 0) {
        p2 <- dfPlot %>%
            filter(state %in% all_of(plotDeathStates)) %>%
            ggplot(aes(x=date, y=deaths)) + 
            geom_line(aes(group=state)) + 
            facet_wrap(~state, scales="free_y") + 
            labs(title="Reported deaths by states with meaningfully non-ascending trend", 
                 x=NULL, 
                 y="Reported deaths"
                 )
        print(p2)
    }
    
    # Plot 3: Case states
    if(length(plotCaseStates) > 0) {
        p3 <- dfPlot %>%
            filter(state %in% all_of(plotCaseStates)) %>%
            ggplot(aes(x=date, y=cases)) + 
            geom_line(aes(group=state)) + 
            facet_wrap(~state, scales="free_y") + 
            labs(title="Reported cases by states with meaningfully non-ascending trend", 
                 x=NULL, 
                 y="Reported cases"
                 )
        print(p3)
    }
    
}

plotDeltaFromMax(dfTest, plotStateRatio=TRUE)

plotDeltaFromMax(dfTest, 
                 dfCtyData=cty_newdata_20220913$dfPerCapita, 
                 plotDeathStates=c("IL", "TX", "MA", "NE"), 
                 plotCaseStates=c("IL", "WY")
                 )

Missing vaccines data are also explored:

tmpVaxCounts <- readFromRDS("cty_vaxdata_20220709")$vaxFix %>% 
    count(date, name="vax0709") %>%
    full_join(cty_vaxdata_20220914$vaxFix %>% 
                  count(date, name="vax0914"), 
              by="date"
              )

tmpVaxCounts %>%
    pivot_longer(-c(date)) %>%
    mutate(value=ifelse(is.na(value), 0, value)) %>%
    ggplot(aes(x=date, y=value)) + 
    geom_line(aes(group=name, color=name)) + 
    labs(title="# Counties reporting vaccines by date", x=NULL, y="# Counties reporting") + 
    scale_color_discrete("Source Date")

There is only a small overlapping section of data. Older data will need to be merged for a complete dataset. Consistency of reported vaccines is also checked:

tempGetVax <- function(df, groupBy=c("date")) {
    
    df %>%
        group_by_at(all_of(groupBy)) %>%
        summarize(vxc=sum(vxcpoppct*pop/100, na.rm=TRUE), 
                  vxcgte18=sum(vxcgte18pct*popgte18/100, na.rm=TRUE), 
                  vxcgte65=sum(vxcgte65pct*popgte65/100, na.rm=TRUE), 
                  .groups="drop"
                  )
    
}

tmpVaxSum <- tempGetVax(cty_vaxdata_20220914$vaxFix) %>%
    bind_rows(tempGetVax(readFromRDS("cty_vaxdata_20220709")$vaxFix), .id="src")

tmpVaxSum %>%
    mutate(src=c("1"="14-SEP-22", "2"="09-JUL-22")[src]) %>%
    pivot_longer(-c(src, date)) %>%
    ggplot(aes(x=date, y=value)) + 
    geom_line(aes(group=src, color=src)) + 
    facet_wrap(~name) + 
    labs(title="Vaccinated by source, date, and age bucket", x=NULL, y="# Fully vaccinated") + 
    scale_color_discrete("Data From:")

Data volumes appear to be broadly consistent, further suggestive that pasting back missing historical data is reasonable.

Data Updates

The latest county-level burden data are downloaded:

readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221010.csv", 
                 "usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221010.csv"
                 )
compareList <- list("usafCase"=readFromRDS("cty_newdata_20220913")$dfRaw$usafCase, 
                    "usafDeath"=readFromRDS("cty_newdata_20220913")$dfRaw$usafDeath
                    )

# Use existing clusters
cty_newdata_20221010 <- readRunUSAFacts(maxDate="2022-10-08", 
                                        downloadTo=lapply(readList, 
                                                          FUN=function(x) if(file.exists(x)) NA else x
                                                          ),
                                        readFrom=readList, 
                                        compareFile=compareList, 
                                        writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log", 
                                        ovrwriteLog=TRUE,
                                        useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
                                        skipAssessmentPlots=FALSE,
                                        brewPalette="Paired"
                                        )
## Rows: 3193 Columns: 992
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): County Name, State, StateFIPS
## dbl (989): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 184 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## Rows: 3193 Columns: 992
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): County Name, State, StateFIPS
## dbl (989): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 30
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log

## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 66 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221010_chk_v005.log
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType    cases     new_cases            n
##   <chr>     <dbl>         <dbl>        <dbl>
## 1 before 3.73e+10 92832282      3154684     
## 2 after  3.70e+10 90648951      3104296     
## 3 pctchg 8.36e- 3        0.0235       0.0160
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType  deaths   new_deaths            n
##   <chr>    <dbl>        <dbl>        <dbl>
## 1 before 5.43e+8 1046659      3154684     
## 2 after  5.24e+8  972745      3104296     
## 3 pctchg 3.44e-2       0.0706       0.0160

## NULL

# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221010$useClusters, 
                       caption="Includes only counties with 25k+ population",
                       brewPalette="viridis"
                       )

# Save the refreshed file
saveToRDS(cty_newdata_20221010, ovrWriteError=FALSE)

Vaccines data are also updated, though the process will need to integrate previous data:

cty_vaxdata_20221011 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20221011.csv", 
                                              ctyList=cty_newdata_20221010, 
                                              minDateCD=c("2022-06-09", "2022-06-09"),
                                              maxDateCD="2022-09-29"
                                              )
## Rows: 446971 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
##   state     n
##   <chr> <int>
## 1 AS      136
## 2 FM      136
## 3 GU      272
## 4 MH      136
## 5 MP      136
## 6 PR    10756
## 7 PW      136
## 8 VI      545
## 9 <NA>     79

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## 
## Count of NA records by column
##           state            FIPS popgte65_minpop popgte65_maxpop    popgte65_nnA 
##               0               0               0               0               0 
##               n 
##               0 
## 
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## #   maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
## 
## 
## 
## Will run with parameters:
## burdenVar: cpm dpm 
## vaxVar: vxcpoppct vxcpoppct 
## minDateCD: 2022-06-09 2022-06-09 
## maxDateCD: 2022-09-29 2022-09-29
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -292485037   -1737398     -20921    1895138   67166643 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24796.753   2297.542  10.793   <2e-16 ***
## vaxMetric       9.448     35.514   0.266     0.79    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7738000 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  2.266e-05,  Adjusted R-squared:  -0.0002974 
## F-statistic: 0.07078 on 1 and 3124 DF,  p-value: 0.7902
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -291660235   -1751594     -20106    1897925   69792822 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                15606.10    8344.59   1.870   0.0615 .  
## type>500k               31568.15    4919.11   6.417 1.60e-10 ***
## type100k-500k           21288.00    4903.26   4.342 1.46e-05 ***
## type25k-100k            21739.66    5529.94   3.931 8.63e-05 ***
## vaxMetric:type<25k        194.27     168.02   1.156   0.2477    
## vaxMetric:type>500k       -88.16      69.72  -1.264   0.2062    
## vaxMetric:type100k-500k    67.89      78.76   0.862   0.3888    
## vaxMetric:type25k-100k     70.74     104.37   0.678   0.4979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7741000 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.5302, Adjusted R-squared:  0.529 
## F-statistic: 439.8 on 8 and 3118 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3549904   -14579    -2095    18933   693414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 352.2786    30.1110   11.70   <2e-16 ***
## vaxMetric    -4.4866     0.4654   -9.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101400 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.02888,    Adjusted R-squared:  0.02857 
## F-statistic: 92.92 on 1 and 3124 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3500417   -16774    -5431    14429   705700 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                178.8397   108.8961   1.642 0.100629    
## type>500k               413.3400    64.1940   6.439 1.39e-10 ***
## type100k-500k           147.3023    63.9871   2.302 0.021397 *  
## type25k-100k            254.8892    72.1652   3.532 0.000418 ***
## vaxMetric:type<25k       -0.4989     2.1926  -0.228 0.820011    
## vaxMetric:type>500k      -5.6202     0.9098  -6.177 7.36e-10 ***
## vaxMetric:type100k-500k  -0.9261     1.0278  -0.901 0.367661    
## vaxMetric:type25k-100k   -2.1172     1.3620  -1.555 0.120155    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101000 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.07933,    Adjusted R-squared:  0.07696 
## F-statistic: 33.58 on 8 and 3118 DF,  p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20221011, ovrWriteError=FALSE)

County-level data are post-processed:

cty_postdata_20221010 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221010$dfPerCapita, 
                                               lstCtyVax=cty_vaxdata_20221011$vaxFix, 
                                               lstState=readFromRDS("cdc_daily_221002")$dfPerCapita
                                               )
## 
## Parameter maxDate is: 2022-10-01

# Save the refreshed file
saveToRDS(cty_postdata_20221010, ovrWriteError=FALSE)

The Alaska portion of usmap::plot_usmap() has an issue with Valdex county. The function is updated to allow for excluding problematic states:

# Updated to allow for include and exclude
makeBurdenDatePlot <- function(df, 
                               keyVar,
                               timeLabel,
                               plotTitle=NULL,
                               varLabel=NULL,
                               varFloor=0,
                               varCeiling=+Inf,
                               varDivBy=1,
                               vecRename=c("countyFIPS"="fips"), 
                               includeStates=c(), 
                               excludeStates=c()
                               ) {
    
    # FUNCTION ARGUMENTS:
    # df: a processed data frame with fips, asofDate, burden
    # keyVar: character string for variable to be plotted
    # timeLabel: character string for amount of time (e.g., "1-month" or "5-week")
    # plotTitle: title for the plot (NULL means infer from other arguments)
    # varLabel: label for the variable in plot scale (NULL means infer from other arguments)
    # varFloor: minimum value to be allowed for variable (-Inf means no floor applied)
    # varCeiling: maximum value to be allowed for variable (Inf means no ceiling applied)
    # varDivBy: variable should be divivded by this for plotting
    # vecRename: renaming vector to get desired variables in frame
    # includeStates: specific states to be plotted, where c() means all states
    # excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
    
    # Create varLabel if passed as NULL
    if(is.null(varLabel)) {
        varLabel <- stringr::str_to_upper(stringr::str_extract(keyVar, "^[A-Za-z]*"))
        if((varDivBy > 1) & isTRUE(all.equal(log10(varDivBy) %% 1, 0))) 
            varLabel <- paste0(varLabel, "(", stringr::str_replace(varDivBy, pattern="1", replacement=""), "s)")
        else if (varDivBy != 1) varLabel <- paste0(varLabel, "(units of ", varDivBy, ")")
    }
    
    # Create plotTitle if passed as NULL
    if(is.null(plotTitle)) 
        plotTitle <- paste0(timeLabel, 
                            " coronavirus ", 
                            if(str_detect(stringr::str_to_upper(keyVar), pattern="CPM")) "cases" else "deaths",
                            " by county"
        )
    
    # Resolve includeStates and excludeStates
    if(length(includeStates) > 0) {
        if(length(excludeStates) > 0) {
            cat("\nParamater excludeStates will be ignored since includeStates was passed\n")
            excludeStates <- c()
        }
        invalidInclude <- setdiff(includeStates, c(state.abb, "DC"))
        if(length(invalidInclude) > 0) {
            cat("\nInvalid states passed in includeStates deleted:", paste(invalidInclude, collapse=", "), "\n")
            includeStates <-setdiff(includeStates, invalidInclude)
        }
    }
    
    # Create and return plot
    p1 <- df %>%
        colRenamer(vecRename=vecRename) %>%
        mutate(burden=pmax(pmin(get(all_of(keyVar)), varCeiling), varFloor)/varDivBy) %>%
        select(fips, burden, asofDate) %>%
        usmap::plot_usmap(regions="counties", data=., values="burden", include=includeStates, exclude=excludeStates) + 
        labs(title=plotTitle, 
             subtitle=if(varFloor > -Inf | varCeiling < +Inf) "Floors and/or ceilings applied" else NULL, 
             caption="Source: USA Facts"
        ) +
        scale_fill_continuous(paste0(varLabel, "\n", timeLabel), low="grey", high="red") +
        facet_wrap(~asofDate) + 
        theme(legend.position="bottom")
    p1
    
}

postProcessCountyData <- function(lstCtyBurden,
                                  lstCtyVax,
                                  lstState, 
                                  maxDate=NULL, 
                                  minDateBurden="2020-02-15", 
                                  minDateVax="2021-04-01", 
                                  includeStates=c(), 
                                  excludeStates=c()
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # lstCtyBurden: list of processed county-level burden data (or a dfPerCapita file from this list)
    # lstCtyVax: list of processed county-level vaccines data (or a vaxFix file from this list)
    # lstState: list of processed state-level burden data (or a dfPerCapita file from this list)
    # maxDate: maximum date to use for plotting (NULL means latest date in both lstCty and lstState)
    # minDateBurden: earliest date for scoring burden similarity across files
    # minDateVax: earliest date for scoring vaccine similarity across files
    # includeStates: specific states to be plotted, where c() means all states
    # excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
    
    # Extract the relevant perCapita and vaxFix data if needed
    if("list" %in% class(lstCtyBurden)) lstCtyBurden <- lstCtyBurden[["dfPerCapita"]]
    if("list" %in% class(lstState)) lstState <- lstState[["dfPerCapita"]]
    if("list" %in% class(lstCtyVax)) lstCtyVax <- lstCtyVax[["vaxFix"]]
    
    # Get maxDate if not provided
    if(is.null(maxDate)) maxDate <- min(max(lstCtyBurden$date, na.rm=TRUE), max(lstState$date, na.rm=TRUE))
    cat("\nParameter maxDate is:", as.character(maxDate), "\n\n")
    
    # Data for score similarity process
    dfCompare <- compareStateSummedCounty(dfState=lstState, 
                                          dfCounty=lstCtyBurden, 
                                          inclStates=c(state.abb, "DC"), 
                                          dateThru=maxDate, 
                                          makePlot=FALSE,
                                          returnData=TRUE
    )
    scoreSimilarity(dfCompare, minDate=minDateBurden, maxDate=maxDate, makeFacet=FALSE)
    
    # Check differences in data sources
    dfAllState <- integrateStateVaccine(lstCtyVax, statePerCap=lstState, treatNAZero=TRUE)
    vaxDiff <- scoreVaxSimilarity(dfAllState, minDate=minDateVax, maxDate=maxDate, returnBaseData=TRUE)
    
    # Create county-level burden data by quarters
    dfRoll91 <- createBurdenCountyDate(lstCtyBurden, 
                                       maxDate=maxDate, 
                                       rollBy=months(c(0, -3, -6, -9)), 
                                       dateSpan=91
    )
    makeBurdenDatePlot(dfRoll91, 
                       keyVar="cpm91", 
                       timeLabel="3-month", 
                       varCeiling=100000, 
                       varDivBy=1000, 
                       includeStates=includeStates, 
                       excludeStates=excludeStates
                       ) %>% 
        print()
    makeBurdenDatePlot(dfRoll91, 
                       keyVar="dpm91", 
                       timeLabel="3-month", 
                       varCeiling=1500, 
                       includeStates=includeStates, 
                       excludeStates=excludeStates
                       ) %>% 
        print()
    
    # Return the key elements
    list(dfCompare=dfCompare, dfAllState=dfAllState, vaxDiff=vaxDiff, dfRoll91=dfRoll91)
    
}

County-level data are post-processed:

cty_postdata_20221010_v2 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221010$dfPerCapita, 
                                                  lstCtyVax=cty_vaxdata_20221011$vaxFix, 
                                                  lstState=readFromRDS("cdc_daily_221002")$dfPerCapita, 
                                                  excludeStates="AK"
                                                  )
## 
## Parameter maxDate is: 2022-10-01

# Check equivalence of data files
identical(cty_postdata_20221010, cty_postdata_20221010_v2)
## [1] TRUE

Additional post-processing steps are run:

# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221010, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).

# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221010, 
                            p1CompareStates=c("GA", "FL", "NE"), 
                            p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"), 
                            p3VaxTimes=sort(c("2022-01-01", "2022-09-28")),
                            p4DF=cty_newdata_20221010$dfPerCapita
                            )
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).

The function is updated to allow for excluding states (such as AK in this case where Valdez County issues in usmap cause an NA issue):

additionalCountyPostProcess <- function(lstPost, 
                                        p1CompareStates=c(), 
                                        p1AggData=FALSE, 
                                        p2VaxStates=c(), 
                                        p3VaxTimes=c(), 
                                        p4DF=NULL,
                                        p4MaxDate=NULL, 
                                        p4RollBy=months(c(0, -1, -2, -3)),
                                        p4DateSpan=28, 
                                        p4CPMCeiling=50000, 
                                        p4DPMCeiling=500, 
                                        includeStates=c(), 
                                        excludeStates=c()
                                        ) {
    
    # FUNCTION ARGUMENTS:
    # lstPost: list of post-processed county data
    # p1CompareStates: states that should be compared vs. summed county
    # p1AggData: boolean, should the comparison states all be aggregated to a single comparison?
    # p2VaxStates: states that should be compared for vaccine evolution
    # p3VaxTimes: character vector of form c(minDate, maxDate) for time period to score vaccine similarity
    # p4DF: data frame for creating rolling data (NULL means do not run)
    # p4MaxDate: maximum date for rolling analysis (NULL means use maximum date in p4DF minus 1 day)
    # p4RollBy: time periods to roll back for analysis
    # p4DateSpan: size of windows for rolling analysis
    # p4CPMCeiling: ceiling for plots on CPM (all values at or above this will be the same color)
    # p4DPMCeiling: ceiling for plots on DPM (all values at or above this will be the same color)
    # includeStates: specific states to be plotted, where c() means all states
    # excludeStates: specific states NOT to be plotted (ignored if includeStates has length > 0)
    
    # 1. Plotting state vs. summed county for key states
    if(length(p1CompareStates) > 0) {
        compareStateSummedCounty(lstAll=lstPost[["dfCompare"]], 
                                 inclStates=p1CompareStates, 
                                 createData=FALSE, 
                                 aggData=p1AggData
        )
    }
    
    # 2. Plot differences in vaccines data if needed
    if(length(p2VaxStates) > 0) 
        stateAgeVaxEvolution(lstPost[["dfAllState"]], keyState=p2VaxStates)
    
    # 3. Check vaccine similarity scoring on a different time period
    if(length(p3VaxTimes) > 0) {
        if(length(p3VaxTimes) != 2 | p3VaxTimes[2] < p3VaxTimes[1]) 
            cat("\np3VaxTimes should be c(minDate, maxDate), skipping this step due to bad parameter\n")
        else 
            scoreVaxSimilarity(lstPost[["dfAllState"]], minDate=p3VaxTimes[1], maxDate=p3VaxTimes[2])
    }
    
    # 4. Additional rolling data as needed
    if(!is.null(p4DF)) {
        if(is.null(p4MaxDate)) p4MaxDate <- max(p4DF$date) - lubridate::days(1)
        dfRoll <- createBurdenCountyDate(p4DF, maxDate=p4MaxDate, rollBy=p4RollBy, dateSpan=p4DateSpan)
        makeBurdenDatePlot(dfRoll, 
                           keyVar=paste0("cpm", p4DateSpan), 
                           timeLabel=paste0(p4DateSpan, "-day"), 
                           varCeiling=p4CPMCeiling, 
                           varDivBy=1000, 
                           includeStates=includeStates, 
                           excludeStates=excludeStates
        ) %>%
            print()
        makeBurdenDatePlot(dfRoll, 
                           keyVar=paste0("dpm", p4DateSpan), 
                           timeLabel=paste0(p4DateSpan, "-day"), 
                           varCeiling=p4DPMCeiling, 
                           includeStates=includeStates, 
                           excludeStates=excludeStates
        ) %>%
            print()
    }
    
}

Additional post-processing steps are re-run:

# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221010, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).

# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221010, 
                            p1CompareStates=c("GA", "FL", "NE"), 
                            p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"), 
                            p3VaxTimes=sort(c("2022-01-01", "2022-09-28")),
                            p4DF=cty_newdata_20221010$dfPerCapita, 
                            excludeStates=c("AK")
                            )
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).

Restatement issues are explore:

# Check for states
dfTest_20221010_state <- findDeltaFromMax(cty_newdata_20221010$dfPerCapita, 
                                          groupBy="state", 
                                          numVar=c("cases", "deaths")
                                          )

# Check for counties
dfTest_20221010_cty <- findDeltaFromMax(cty_newdata_20221010$dfPerCapita, 
                                        groupBy=c("countyFIPS", "state"), 
                                        numVar=c("cases", "deaths")
                                        )

# Explore state differences
plotDeltaFromMax(dfTest_20221010_state, plotStateRatio=TRUE)

plotDeltaFromMax(dfTest_20221010_state,
                 dfCtyData=cty_newdata_20221010$dfPerCapita,
                 plotDeathStates=c("IL", "MA", "NE"),
                 plotCaseStates=c("IL", "WY")
                 )

Counties are also explored for restatement, using a combination of value and ratio:

dfRatio_cty <- dfTest_20221010_cty %>% 
    group_by(countyFIPS, state, name) %>% 
    filter(delMax==min(delMax)) %>% 
    filter(date==max(date)) %>% 
    ungroup()
dfRatio_cty
## # A tibble: 6,284 × 8
##    countyFIPS state date       name   value ratMax cumMax    delMax
##    <chr>      <chr> <date>     <chr>  <dbl>  <dbl>  <dbl>     <dbl>
##  1 01001      AL    2022-04-19 cases  15755  0.855  0.999 -0.000869
##  2 01001      AL    2022-03-27 deaths   209  0.917  0.995 -0.00439 
##  3 01003      AL    2022-04-19 cases  55564  0.845  1.00  -0.000228
##  4 01003      AL    2021-04-14 deaths   300  0.420  0.997 -0.00140 
##  5 01005      AL    2021-06-27 cases   2344  0.339  0.999 -0.000289
##  6 01005      AL    2021-05-19 deaths    56  0.544  0.982 -0.00971 
##  7 01007      AL    2021-04-14 cases   2559  0.340  0.998 -0.000663
##  8 01007      AL    2021-04-12 deaths    58  0.537  0.967 -0.0185  
##  9 01009      AL    2021-11-11 cases  10494  0.616  0.995 -0.00317 
## 10 01009      AL    2021-04-22 deaths   133  0.516  0.985 -0.00775 
## # … with 6,274 more rows
## # ℹ Use `print(n = ...)` to see more rows
# Explore ratios of deaths by county - IL/MA/NE vs. others
dfRatio_cty %>%
    filter(name=="deaths", value > 0) %>%
    mutate(type=case_when(state=="IL" ~ "IL", state %in% c("MA", "NE") ~ "MA/NE", state=="WY" ~ "WY", TRUE ~ "Other"), 
           rat=1+delMax
           ) %>%
    ggplot(aes(x=rat)) + 
    geom_histogram(aes(fill=type)) + 
    facet_wrap(~type, nrow=2, scales="free_y") + 
    labs(x="Lowest ratio of (deaths-cummax(deaths))/max(deaths) by county", 
         y="# Counties", 
         title="Death ratios by county (1.0 means strictly non-descending, 0.0 means reports 0 after max)"
         )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Explore ratios of cases by county - IL/WY vs. others
dfRatio_cty %>%
    filter(name=="cases", value > 0) %>%
    mutate(type=case_when(state=="IL" ~ "IL", state %in% c("MA", "NE") ~ "MA/NE", state=="WY" ~ "WY", TRUE ~ "Other"), 
           rat=1+delMax
           ) %>%
    ggplot(aes(x=rat)) + 
    geom_histogram(aes(fill=type)) + 
    facet_wrap(~type, nrow=2, scales="free_y") + 
    labs(x="Lowest ratio of (cases-cummax(cases))/max(cases) by county", 
         y="# Counties", 
         title="Case ratios by county (1.0 means strictly non-descending, 0.0 means reports 0 after max)"
         )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Case data generally has more restatement than deaths data. IL, MA, and NE all have significant county-level changes for deaths, while WY has the most counties reporting large changes in cases

A sample of counties with high death restatements are plotted:

dfRatio_cty %>% 
    filter(name=="deaths", delMax < -0.01, delMax*value < -100) %>%
    select(countyFIPS) %>%
    left_join(cty_newdata_20221010$dfPerCapita, by="countyFIPS") %>%
    ggplot(aes(x=date, y=deaths)) + 
    geom_line() + 
    facet_wrap(~paste0(countyFIPS, " (", state, ")"), scales="free_y")

The issue in IL is driven almost exclusively by 17031, which restated away around 50% of deaths. The issue in MA appears to be consistent across counties, with the restated data somewhat close to on trend prior to the spike. Two counties in TX and NJ have an erroneous spike, while one county in TX (48141) restated away almost all deaths

A sample of counties with high case restatements are plotted:

dfRatio_cty %>% 
    filter(name=="cases", delMax < -0.01, delMax*value < -2000) %>%
    select(countyFIPS) %>%
    left_join(cty_newdata_20221010$dfPerCapita, by="countyFIPS") %>%
    ggplot(aes(x=date, y=cases)) + 
    geom_line() + 
    facet_wrap(~paste0(countyFIPS, " (", state, ")"), scales="free_y")

Many counties with high case restatements appear to mainly follow trend at a glance. Significant off-trend deviations occur in 17031 (IL), 25007 (MA), 48139 (TX), 48141 (TX), 48189 (TX), and 48279 (TX). The deviations in Wyoming appear to be one-time issues.

Data with and without anomalous counties can then be plotted:

anom_cty <- dfRatio_cty %>% 
    filter(name=="cases") %>%
    mutate(bigPct=(delMax < -0.05), bigVol=(delMax*value < -2000))
anom_cty %>%
    count(bigPct, bigVol)
## # A tibble: 4 × 3
##   bigPct bigVol     n
##   <lgl>  <lgl>  <int>
## 1 FALSE  FALSE   3011
## 2 FALSE  TRUE       6
## 3 TRUE   FALSE    113
## 4 TRUE   TRUE      12
anom_cty %>%
    select(countyFIPS, state, bigPct, bigVol) %>%
    inner_join(cty_newdata_20221010$dfPerCapita, by=c("state", "countyFIPS")) %>%
    mutate(type=case_when(bigVol ~ "2) Big volume change", 
                          bigPct ~ "3) Big percent change", 
                          TRUE ~ "1) Low/no percent change"
                          )
           ) %>%
    group_by(type, date) %>%
    summarize(across(c(cases, deaths), sum), .groups="drop") %>%
    ggplot(aes(x=date, y=cases)) + 
    geom_line() + 
    facet_wrap(~type, scales="free_y") + 
    labs(title="Sum of cases by county type", 
         x=NULL, 
         y="Sum of cases", 
         subtitle="Big volume change is at least 2000 cases, big percentage change is at least 5% (but not 2000 cases)"
         )

Case data appear reasonable after anomalies are removed. The large percentage changes make some impact but appear to generally average out and remain on trend. The large volume changes appear to be anomalies that impact trends, especially in the most recent time periods.

The process is converted to functional form:

plotByRestatement <- function(dfRatio, 
                              dfBurden, 
                              idVars=c("countyFIPS", "state"),
                              keyMetric="cases", 
                              keyMetricBurden=keyMetric, 
                              delMaxHurdle=0,
                              bigVolHurdle=0, 
                              returnData=FALSE
                              ) {
    
    # FUNCTION ARGUMENTS:
    # dfRatio: file containing key metrics for determining restatement status
    # dfBurden: the burden data by geography
    # idVars: variables that identify a geographical unit
    # keyMetric: name of variable in the ratio file
    # keyMetricBurden: name of variable in the burden file
    # delMaxHurdle: cutoff for determining large delMax
    # bigVolHurdle: cutoff for determining large delMax*value
    # returnData: boolean, should dfSeg be returned?
    
    # Create segment data
    dfSeg <- dfRatio %>% 
        filter(name==keyMetric) %>%
        mutate(bigPct=(delMax < delMaxHurdle), bigVol=(delMax*value < bigVolHurdle))
    
    # Report on segment data
    dfSeg %>%
        count(bigPct, bigVol) %>%
        print()

    # Create and print plot
    p1 <- dfSeg %>%
        select(c(all_of(idVars), "bigPct", "bigVol")) %>%
        inner_join(dfBurden, by=all_of(idVars)) %>%
        mutate(type=case_when(bigVol ~ "2) Big volume change", 
                              bigPct ~ "3) Big percent change", 
                              TRUE ~ "1) Low/no percent change"
                              )
               ) %>%
        group_by(type, date) %>%
        summarize(across(all_of(keyMetricBurden), sum), .groups="drop") %>%
        ggplot(aes_string(x="date", y=keyMetricBurden[1])) + 
        geom_line() + 
        facet_wrap(~type, scales="free_y") + 
        labs(title="Sum of cases by segment", 
             x=NULL, 
             y="Sum of cases", 
             subtitle=paste0("Big volume change is more than ", 
                             -bigVolHurdle, 
                             " ", 
                             keyMetricBurden[1],  
                             ", big percentage change is at least ", 
                             round(-100*delMaxHurdle), 
                             "% (but not ", 
                             -bigVolHurdle, 
                             " ", 
                             keyMetricBurden[1], 
                             ")"
                             )
             )
    print(p1)

    # Return data if requested
    if(isTRUE(returnData)) return(dfSeg)
    
}


dfSegTest <- plotByRestatement(dfRatio_cty, 
                               dfBurden=cty_newdata_20221010$dfPerCapita, 
                               keyMetric="cases", 
                               delMaxHurdle=-0.05, 
                               bigVolHurdle=-2000, 
                               returnData=TRUE
                               )
## # A tibble: 4 × 3
##   bigPct bigVol     n
##   <lgl>  <lgl>  <int>
## 1 FALSE  FALSE   3011
## 2 FALSE  TRUE       6
## 3 TRUE   FALSE    113
## 4 TRUE   TRUE      12

identical(anom_cty, dfSegTest)
## [1] TRUE

A function is also written to create the main restatement dataset:

createRestatementData <- function(df, 
                                  geoType="county",
                                  idVars=NULL, 
                                  numVars=c("cases", "deaths"),
                                  varPeak="delMax",
                                  fnPeak=min,
                                  makeIntermediate=TRUE,
                                  returnIntermediate=FALSE
                                  ) {

    # FUNCTION ARGUMENTS:
    # df: data frame containing relevant per-capita data OR previously made dfIntermediate
    # geoType: the type of data (only "county" and "state" are supported)
    # idVars: the ID variables for a geographical unit in df (NULL means infer from geoType)
    # numVars: numeric variables for calculating restatement amounts
    # varPeak: variable to be used for calculating whether restatement occurred
    # fnPeak: function to be applied to determine where 'most' restatement occurred
    # makeIntermediate: boolean, should dfIntermediate be created? Otherwise, df is treated as dfIntermediate
    # returnIntermediate: boolean, should the process be stopped and just dfIntermediate returned?
    
    # Infer idVars if passed as NULL
    if(is.null(idVars)) {
        if(geoType=="county") idVars <- c("countyFIPS", "state")
        else if(geoType=="state") idVars <- c("state")
        else error("\nFunction can only infer for geoType of 'county' or 'state'\n")
    }
    
    # Create intermediate data frame using findDeltaFromMax() OR as passed
    if(isTRUE(makeIntermediate)) df <- findDeltaFromMax(df, groupBy=idVars, numVar=numVars)
    
    # If returnIntermediate is TRUE, return that frame and stop the function
    if(isTRUE(returnIntermediate)) return(df)
    
    # Create and return the ratio data otherwise
    df %>%
        group_by_at(all_of(c(idVars, "name"))) %>%
        filter(get(varPeak)==fnPeak(get(varPeak))) %>%
        filter(date==max(date)) %>%
        ungroup()

}

# Check data creation process for counties
dfInter_cty <- createRestatementData(cty_newdata_20221010$dfPerCapita, 
                                     geoType="county", 
                                     makeIntermediate=TRUE,
                                     returnIntermediate=TRUE
                                     )
identical(dfInter_cty, dfTest_20221010_cty)
## [1] TRUE
# Check data creation process for states
dfInter_state <- createRestatementData(cty_newdata_20221010$dfPerCapita,
                                       geoType="state",
                                       makeIntermediate=TRUE,
                                       returnIntermediate=TRUE
                                       )
identical(dfInter_state, dfTest_20221010_state)
## [1] TRUE
dfRatio_cty_chk <- createRestatementData(dfInter_cty, 
                                         geoType="county", 
                                         makeIntermediate=FALSE,
                                         returnIntermediate=FALSE
                                         )
identical(dfRatio_cty_chk, dfRatio_cty)
## [1] TRUE
dfRatio_state <- createRestatementData(dfInter_state, 
                                       geoType="state", 
                                       makeIntermediate=FALSE,
                                       returnIntermediate=FALSE
                                       )
dfRatio_state
## # A tibble: 102 × 7
##    state date       name     value     ratMax cumMax     delMax
##    <chr> <date>     <chr>    <dbl>      <dbl>  <dbl>      <dbl>
##  1 AK    2021-06-14 cases    68383 0.251      0.982  -0.00448  
##  2 AK    2021-08-21 deaths     382 0.292      0.972  -0.00842  
##  3 AL    2022-04-19 cases  1298473 0.851      1.00   -0.000239 
##  4 AL    2021-04-09 deaths   10686 0.522      1.00   -0.000244 
##  5 AR    2022-10-06 cases   921985 1          1       0        
##  6 AR    2021-04-01 deaths    5636 0.460      0.994  -0.00269  
##  7 AZ    2022-10-06 cases  2275235 1          1       0        
##  8 AZ    2022-03-08 deaths   27708 0.928      0.991  -0.00797  
##  9 CA    2020-02-02 cases       27 0.00000259 0.0327 -0.0000767
## 10 CA    2021-12-27 deaths   75398 0.794      1.00   -0.000347 
## # … with 92 more rows
## # ℹ Use `print(n = ...)` to see more rows

Ratio data can then be plotted:

dfRatio_state %>% 
    ggplot(aes(x=fct_reorder(state, delMax), y=delMax)) + 
    geom_col() + 
    geom_text(aes(label=paste0(round(100*delMax, 0), "%")), hjust=1, size=3) + 
    coord_flip() + 
    facet_wrap(~name) + 
    labs(title="Maximum downward restatement from previous peak vs. maximum value for state", 
         x=NULL, 
         y=NULL, 
         subtitle="Data through September 2022"
         )

dfRatio_cty %>%
    filter(name=="deaths") %>%
    top_n(25, -delMax*value) %>%
    mutate(loss=-delMax*value, lab=paste0(countyFIPS, " (", state, ")")) %>%
    select(lab, loss, delMax) %>%
    pivot_longer(-lab) %>%
    mutate(textLab=ifelse(name=="delMax", paste0(round(value*100, 0), "%"), round(value, 0))) %>%
    ggplot(aes(x=fct_reorder(lab, value), y=value)) + 
    geom_text(aes(label=textLab, hjust=ifelse(name=="delMax", 1, 0))) +
    geom_col(fill="lightblue") + 
    coord_flip() +
    facet_wrap(~c("loss"="Magnitude of restatement", "delMax"="Percent lost")[name], scales="free_x") + 
    labs(x=NULL, y=NULL, title="Counties with significant death restatements", subtitle="Data as of Sep 2022")

dfRatio_cty %>%
    filter(name=="cases") %>%
    top_n(25, -delMax*value) %>%
    mutate(loss=-delMax*value, lab=paste0(countyFIPS, " (", state, ")")) %>%
    select(lab, loss, delMax) %>%
    pivot_longer(-lab) %>%
    mutate(textLab=ifelse(name=="delMax", paste0(round(value*100, 0), "%"), round(value, 0))) %>%
    ggplot(aes(x=fct_reorder(lab, value), y=value)) + 
    geom_text(aes(label=textLab, hjust=ifelse(name=="delMax", 1, 0))) +
    geom_col(fill="lightblue") + 
    coord_flip() +
    facet_wrap(~c("loss"="Magnitude of restatement", "delMax"="Percent lost")[name], scales="free_x") + 
    labs(x=NULL, y=NULL, title="Counties with significant case restatements", subtitle="Data as of Sep 2022")

As observed previously, most states have at most modest restatement in the USAF data

The latest county-level burden data are downloaded:

readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221108.csv", 
                 "usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221108.csv"
                 )
compareList <- list("usafCase"=readFromRDS("cty_newdata_20221010")$dfRaw$usafCase, 
                    "usafDeath"=readFromRDS("cty_newdata_20221010")$dfRaw$usafDeath
                    )

# Use existing clusters
cty_newdata_20221108 <- readRunUSAFacts(maxDate="2022-11-06", 
                                        downloadTo=lapply(readList, 
                                                          FUN=function(x) if(file.exists(x)) NA else x
                                                          ),
                                        readFrom=readList, 
                                        compareFile=compareList, 
                                        writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log", 
                                        ovrwriteLog=TRUE,
                                        useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
                                        skipAssessmentPlots=FALSE,
                                        brewPalette="Paired"
                                        )
## Rows: 3193 Columns: 1020
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (3): County Name, State, StateFIPS
## dbl (1017): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 28
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 89 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## Rows: 3193 Columns: 1020
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (3): County Name, State, StateFIPS
## dbl (1017): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 28
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 2 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log

## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 38 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221108_chk_v005.log
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType    cases     new_cases            n
##   <chr>     <dbl>         <dbl>        <dbl>
## 1 before 3.99e+10 93706741      3244088     
## 2 after  3.95e+10 91519270      3192272     
## 3 pctchg 9.35e- 3        0.0233       0.0160
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType  deaths   new_deaths            n
##   <chr>    <dbl>        <dbl>        <dbl>
## 1 before 5.72e+8 1054379      3244088     
## 2 after  5.52e+8  979772      3192272     
## 3 pctchg 3.63e-2       0.0708       0.0160

## NULL

# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221108$useClusters, 
                       caption="Includes only counties with 25k+ population",
                       brewPalette="viridis"
                       )

# Save the refreshed file
saveToRDS(cty_newdata_20221108, ovrWriteError=FALSE)

Vaccines data are also updated, though the process will need to integrate previous data:

cty_vaxdata_20221109 <- processCountyVaccines(loc="./RInputFiles/Coronavirus/county_vaccine_20221109.csv", 
                                              ctyList=cty_newdata_20221108, 
                                              minDateCD=c("2022-06-09", "2022-06-09"),
                                              maxDateCD="2022-10-29"
                                              )
## Rows: 421046 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Date, FIPS, Recip_County, Recip_State, SVI_CTGY, Metro_status
## dbl (66): MMWR_week, Completeness_pct, Administered_Dose1_Recip, Administere...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## Records from other than 50 states and DC:
## # A tibble: 9 × 2
##   state     n
##   <chr> <int>
## 1 AS      129
## 2 FM      129
## 3 GU      256
## 4 MH      128
## 5 MP      128
## 6 PR    10127
## 7 PW      128
## 8 VI      512
## 9 <NA>     79

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

## 
## Count of NA records by column
##           state            FIPS popgte65_minpop popgte65_maxpop    popgte65_nnA 
##               0               0               0               0               0 
##               n 
##               0 
## 
## Records where minimum and maximum population differ# A tibble: 0 × 5
## # … with 5 variables: state <chr>, FIPS <chr>, age <chr>, minpop <dbl>,
## #   maxpop <dbl>
## # ℹ Use `colnames()` to see all variable names
## 
## 
## 
## Will run with parameters:
## burdenVar: cpm dpm 
## vaxVar: vxcpoppct vxcpoppct 
## minDateCD: 2022-06-09 2022-06-09 
## maxDateCD: 2022-10-29 2022-10-29
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -295362078   -1799774      81846    2125375   66057276 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25179.46    2367.67  10.635   <2e-16 ***
## vaxMetric      46.37      36.60   1.267    0.205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7975000 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.0005136,  Adjusted R-squared:  0.0001936 
## F-statistic: 1.605 on 1 and 3124 DF,  p-value: 0.2053
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##        Min         1Q     Median         3Q        Max 
## -294332768   -1853432      41157    2093372   68302435 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                16012.93    8599.68   1.862 0.062692 .  
## type>500k               29996.85    5069.49   5.917 3.63e-09 ***
## type100k-500k           21017.91    5053.15   4.159 3.28e-05 ***
## type25k-100k            21759.38    5698.99   3.818 0.000137 ***
## vaxMetric:type<25k        238.15     173.16   1.375 0.169130    
## vaxMetric:type>500k       -25.74      71.85  -0.358 0.720217    
## vaxMetric:type100k-500k   115.45      81.17   1.422 0.155015    
## vaxMetric:type25k-100k    122.80     107.56   1.142 0.253656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7977000 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.5659, Adjusted R-squared:  0.5648 
## F-statistic:   508 on 8 and 3118 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Removed 16 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric, data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3581081   -16937    -1522    23241   702906 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 380.7127    30.9273   12.31   <2e-16 ***
## vaxMetric    -4.5654     0.4781   -9.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104200 on 3124 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.02837,    Adjusted R-squared:  0.02806 
## F-statistic:  91.2 on 1 and 3124 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = get(burdenVar) ~ vaxMetric * type + 0 - vaxMetric, 
##     data = dfReg, weights = pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -3530993   -20262    -5986    17362   719666 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## type<25k                204.9955   111.8143   1.833 0.066845 .  
## type>500k               392.6450    65.9142   5.957 2.86e-09 ***
## type100k-500k           178.3898    65.7018   2.715 0.006661 ** 
## type25k-100k            275.7474    74.0991   3.721 0.000202 ***
## vaxMetric:type<25k       -0.2408     2.2514  -0.107 0.914821    
## vaxMetric:type>500k      -5.0306     0.9342  -5.385 7.78e-08 ***
## vaxMetric:type100k-500k  -1.0842     1.0554  -1.027 0.304357    
## vaxMetric:type25k-100k   -1.9145     1.3985  -1.369 0.171090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103700 on 3118 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.107,  Adjusted R-squared:  0.1047 
## F-statistic: 46.69 on 8 and 3118 DF,  p-value: < 2.2e-16
# Save the refreshed file
saveToRDS(cty_vaxdata_20221109, ovrWriteError=FALSE)

County-level data are post-processed:

cty_postdata_20221108 <- postProcessCountyData(lstCtyBurden=cty_newdata_20221108$dfPerCapita, 
                                               lstCtyVax=cty_vaxdata_20221109$vaxFix, 
                                               lstState=readFromRDS("cdc_daily_221102")$dfPerCapita, 
                                               excludeStates="AK"
                                               )
## 
## Parameter maxDate is: 2022-11-01

Additional post-processing steps are re-run:

# Step 1a: Burden comparisons for aggregated states
additionalCountyPostProcess(cty_postdata_20221108, p1CompareStates=c(state.abb, "DC"), p1AggData=TRUE)
## Warning: Removed 6 row(s) containing missing values (geom_path).

# Step 1: Burden aggregation for key states
# Step 2: vaccine comparisons
# Step 3: Scoring updates (and errors)
# Step 4: New rolling data (28-day default with ceilings 50000 CPM, 500 DPM)
additionalCountyPostProcess(cty_postdata_20221108, 
                            p1CompareStates=c("GA", "FL", "NE"), 
                            p2VaxStates=c("MA", "HI", "TX", "VA", "VT", "GA", "CO", "SD"), 
                            p3VaxTimes=sort(c("2022-01-01", "2022-10-26")),
                            p4DF=cty_newdata_20221108$dfPerCapita, 
                            excludeStates=c("AK")
                            )
## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 6 row(s) containing missing values (geom_path).

## Warning: Removed 1 row(s) containing missing values (geom_path).

Post-processing will need to be updated to account for changes in data availability in the CDC data file

Memory usage is explored:

# List of files
sapply(ls(), FUN=function(x) object.size(get(x))) %>% sort(decreasing=FALSE)
##               usafUpdatedURL                  usafMainURL 
##                          168                          232 
##             rawMakeVarMapper                problemStates 
##                          592                          624 
##                     readList                       zeroNA 
##                          832                          840 
##               fullListMapper                     zeroPad2 
##                          848                          896 
##                     zeroPad5                  pivotMapper 
##                          896                         1024 
##      checkControlGroupMapper         plotSimilarityMapper 
##                         1056                         1056 
##                  glimpseFile       checkControlVarsMapper 
##                         1064                         1344 
##                     fileRead                     uqMapper 
##                         1344                         1400 
##                 perCapMapper                    urlMapper 
##                         1408                         1488 
##             lstExcludeMapper                 combineFiles 
##                         1544                         2072 
##              vecSelectMapper                  colSelector 
##                         2160                         2184 
##                   colRenamer                   glimpseLog 
##                         2632                         2688 
##                      zeroPad                 customYYYYMM 
##                         2744                         2856 
##                  vecToTibble                    useStates 
##                         3192                         3248 
##                    genNewLog                sumImputedHHS 
##                         3272                         3368 
##                    renMapper             helperRollingAgg 
##                         3488                         3584 
##               lstComboMapper                    pivotData 
##                         3624                         4256 
##                 fileDownload                    skinnyHHS 
##                         4264                         4272 
##       createBurdenCountyDate          postProcessCDCDaily 
##                         4480                         4480 
##        processCountyVaccines                     printLog 
##                         4496                         4856 
##                   joinFrames                 getStateData 
##                         5040                         5272 
##                 lagCorrCheck              checkUniqueRows 
##                         5488                         5536 
##              helperPerCapita                    rowFilter 
##                         5824                         6216 
##                   colMutater                dfMinMaxState 
##                         6272                         6536 
##             dfMinMaxState_v2                  getClusters 
##                         6536                         6832 
##        checkSimilarityMapper               onePageCFRPlot 
##                         6912                         6928 
##            getCountyClusters                    saveToRDS 
##                         7280                         7608 
##              lstFilterMapper                  specSumProd 
##                         7800                         7896 
##               helperLinePlot              clustersToFrame 
##                         8152                         8744 
##                       specNA          helperMakePerCapita 
##                         8888                         9016 
##                 checkControl               createGroupAgg 
##                         9192                         9296 
##                getCountyData                 testImputeNA 
##                         9608                         9632 
##                integrateData                 tmpVaxCounts 
##                        10304                        10360 
##                dfRatio_state              createPerCapita 
##                        10504                        10512 
##               combineAggData             imputeNACapacity 
##                        10864                        11016 
##         pivotStateBurdenData             helperSimilarity 
##                        11120                        11448 
##               plotSimilarity                    findPeaks 
##                        11488                        11984 
##          createVaxBurdenData        integrateStateVaccine 
##                        12496                        12768 
##            makeBurdenSummary                  combineRows 
##                        13736                        13776 
## createSummedCountyBurdenData            makeCaseHospDeath 
##                        13968                        14560 
##              checkSimilarity              clusterCounties 
##                        14736                        15072 
##               selfListMapper               helperAggTrend 
##                        15952                        17232 
##             diagnoseClusters               processRawFile 
##                        17472                        18480 
##               flagLargeDelta         cumulativeBurdenPlot 
##                        18992                        19728 
##               helperAggTotal                tempStackPlot 
##                        20448                        20736 
##     hospitalCapacityCDCDaily           scoreVaxSimilarity 
##                        20792                        20968 
##       sparseCountyClusterMap                  readFromRDS 
##                        21072                        21232 
##         stateAgeVaxEvolution          repairVaxPopulation 
##                        21560                        22264 
##       downloadCountyVaccines                    tmpVaxSum 
##                        22528                        24872 
##                 createGeoMap                corrVaxBurden 
##                        24880                        25096 
##                  helperElbow                 keyAggMapper 
##                        25592                        26624 
##     downloadReadHospitalData            plotVaxBurdenData 
##                        27312                        27368 
##       plotCombineAggByMapper             compareAggregate 
##                        28552                        29776 
##     compareStateSummedCounty            filterPopStateAge 
##                        30616                        31616 
##             hospAgePerCapita              scoreSimilarity 
##                        31632                        32152 
##                   plotCFRLag             helperSummaryMap 
##                        32648                        32800 
##                createSummary        postProcessCountyData 
##                        33424                        33592 
##            readQCRawCDCDaily                findCorrAlign 
##                        33816                        35104 
##              readRunCDCDaily        cumulativeVaccinePlot 
##                        35624                        36752 
##              readRunUSAFacts  additionalCountyPostProcess 
##                        38008                        38080 
##      plotHospitalUtilization                   countyCorr 
##                        38224                        39232 
##                readQCRawUSAF              readPopStateAge 
##                        42776                        43456 
##            createBurdenPivot                   tempGetVax 
##                        44824                        45496 
##           peakValleyCDCDaily               makePeakValley 
##                        46208                        50344 
##           makeBurdenDatePlot                clusterStates 
##                        52216                        53696 
##            bucketPopStateAge      createDetailedSummaries 
##                        54760                        77552 
##        createRestatementData             findDeltaFromMax 
##                       102720                       189520 
##            plotByRestatement                     keyRatio 
##                       230688                       280600 
##                    decStatus                     dfMinMax 
##                       305872                       327368 
##             plotDeltaFromMax                     anom_cty 
##                       402320                       407360 
##                    dfSegTest                  dfRatio_cty 
##                       407360                       583032 
##              dfRatio_cty_chk                tmpCountyList 
##                       583032                       591032 
##            keyRatioDateState         keyRatioDateState_v2 
##                      4695024                      4695024 
##                       dfTest                dfInter_state 
##                      5476888                      5648248 
##        dfTest_20221010_state         cty_vaxdata_20220914 
##                      5648248                     14110816 
##        cty_postdata_20220913        cty_postdata_20221010 
##                     21262688                     22494560 
##     cty_postdata_20221010_v2        cty_postdata_20221108 
##                     22494560                     23122616 
##         cty_vaxdata_20221109         cty_vaxdata_20221011 
##                     67665856                     71723936 
##                 keyRatioDate                  compareList 
##                    337304784                    353948352 
##                    dfTest_v2                 tmpVaxBurden 
##                    385465464                    390902824 
##                  dfInter_cty          dfTest_20221010_cty 
##                    397530744                    397530744 
##         cty_newdata_20220913         cty_newdata_20221010 
##                   1351507784                   1393594904 
##         cty_newdata_20221108 
##                   1432876216
# Function for deleting files and checking memory impact
cleanMem <- function(objs=c(), runGC=TRUE, delObjs=FALSE) {
    
    # FUNCTION ARGUMENTS:
    # objs: character vector of objects to be removed
    # runGC: boolean, should gc() be run before and after deletion?
    # delObjs: boolean, should the objects be deleted?
    
    # Run gc() if requested
    if(isTRUE(runGC)) {
        cat("\nMemory usage prior to deleting files:\n")
        print(gc())
    }
    
    # Remove objects that exist, if requested
    if(isTRUE(delObjs)) {
        objs <- objs[sapply(objs, FUN=function(x) exists(x)) %>% unname]
        rm(list=objs, inherits=TRUE)
    }
    
    # Run gc() if requested
    if(isTRUE(runGC)) {
        cat("\nMemory usage after deleting files:\n")
        print(gc())
    }
    
}

# Clean large objects
largeObjs <- c("cty_newdata_20220913", "cty_newdata_20221010", "dfTest_20221010_cty", "dfInter_cty")
cleanMem(largeObjs, delObjs=TRUE)
## 
## Memory usage prior to deleting files:
##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    1756502   93.9   13970685   746.2   28289914  1510.9
## Vcells 1012483775 7724.7 1641951356 12527.1 1367737348 10435.1
## 
## Memory usage after deleting files:
##             used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells   1518306   81.1   11176548   596.9   28289914  1510.9
## Vcells 456924338 3486.1 1313561085 10021.7 1367737348 10435.1

The latest county-level burden data are downloaded:

readList <- list("usafCase"="./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221210.csv", 
                 "usafDeath"="./RInputFiles/Coronavirus/covid_deaths_usafacts_downloaded_20221210.csv"
                 )
compareList <- list("usafCase"=readFromRDS("cty_newdata_20221108")$dfRaw$usafCase, 
                    "usafDeath"=readFromRDS("cty_newdata_20221108")$dfRaw$usafDeath
                    )

# Use existing clusters
cty_newdata_20221210 <- readRunUSAFacts(maxDate="2022-12-08", 
                                        downloadTo=lapply(readList, 
                                                          FUN=function(x) if(file.exists(x)) NA else x
                                                          ),
                                        readFrom=readList, 
                                        compareFile=compareList, 
                                        writeLog="./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log", 
                                        ovrwriteLog=TRUE,
                                        useClusters=readFromRDS("cty_newdata_20210813")$useClusters,
                                        skipAssessmentPlots=FALSE,
                                        brewPalette="Paired"
                                        )
## 
## No file has been downloaded, will use existing file: ./RInputFiles/Coronavirus/covid_confirmed_usafacts_downloaded_20221210.csv
## Rows: 3193 Columns: 1051
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (3): County Name, State, StateFIPS
## dbl (1048): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 31
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## Rows: 3193 Columns: 1051
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (3): County Name, State, StateFIPS
## dbl (1048): countyFIPS, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-25, 2020...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## 
## *** File has been checked for uniqueness by: countyFIPS countyName state stateFIPS 
## 
## 
## *** File has been checked for uniqueness by: countyFIPS stateFIPS date

## 
## 
## Checking for similarity of: column names
## In reference but not in current: 
## In current but not in reference: 
## 
## Checking for similarity of: date
## In reference but not in current: 0
## In current but not in reference: 31
## Detailed differences available in: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## 
## Checking for similarity of: county
## In reference but not in current: 
## In current but not in reference:

## 
## 
## ***Differences of at least 5 and at least 5%
## 
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log

## 
## 
## ***Differences of at least 0 and at least 0.1%
## 
## 0 records
## Detailed output available in log: ./RInputFiles/Coronavirus/USAF_NewData_20221210_chk_v005.log
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType    cases     new_cases            n
##   <chr>     <dbl>         <dbl>        <dbl>
## 1 before 4.28e+10 94747847      3343071     
## 2 after  4.24e+10 92580585      3289674     
## 3 pctchg 1.03e- 2        0.0229       0.0160
## 
## 
## Column sums before and after applying filtering rules:
## # A tibble: 3 × 4
##   isType  deaths   new_deaths            n
##   <chr>    <dbl>        <dbl>        <dbl>
## 1 before 6.05e+8 1062085      3343071     
## 2 after  5.82e+8  986026      3289674     
## 3 pctchg 3.82e-2       0.0716       0.0160

## NULL

# Plot all counties based on closest cluster
sparseCountyClusterMap(cty_newdata_20221210$useClusters, 
                       caption="Includes only counties with 25k+ population",
                       brewPalette="viridis"
                       )

# Save the refreshed file
saveToRDS(cty_newdata_20221210, ovrWriteError=FALSE)